Out of the 24 counties in the state of Maryland, Montgomery County is the most populous. It boasts the 20th highest household income in the country, and interestingly, the highest percentage of residents over the age of 25 who hold post-graduate degrees (Montgomery County, Maryland). High household income and education indicate that Montgomery County could be a good place to live - but they aren't the only things to consider.
When evaluating a place to work/live/visit, safety is paramount, and one of the largest components of safety is crime. Montgomery County is a large geographical area, covering 507 square miles with a population of over 1 million residents, and therefore it is not sufficient to simply look at overall crime rate statistics to determine if it's a safe place to live. We asked ourselves the following questions:
In order to answer these questions, we embarked on a data analysis journey to determining whether Montgomery County is a "safe" place to live, hoping to find evidence that crime was trending downward, the police department was effective at containing and addressing crime, and if there are any demographic factors that contribute to different patterns within the county.
We produced 6 analyses to answer these questions - our project should be graded more heavily on data analysis.
First, we will need to install and import all required libraries.
# install ipympl for animated visualization
#! conda install -c conda-forge ipympl
# install wordcloud
#! conda install -c conda-forge wordcloud
# install folium for animated map visualization
#! conda install -c conda-forge folium
# Import core packages
import pandas as pd
import numpy as np
import re
import datetime as dt
from matplotlib import pyplot as plt
import seaborn as sns
import requests
import matplotlib
# Import packages for animation
from matplotlib import animation
from matplotlib.animation import FuncAnimation
from IPython.display import HTML, Javascript
# Import packages for map visualizations
import folium
from folium import plugins
from folium.plugins import MarkerCluster
from folium.plugins import HeatMapWithTime
#Importing Mandatory Libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Import packages for panel visualization
import panel as pn
pn.extension('tabulator')
import hvplot.pandas
import holoviews as hv
from bokeh.models import HoverTool
from io import BytesIO
import base64
# Import wordcloud library
from wordcloud import WordCloud
%matplotlib widget
# set some notebook parameters
pd.options.mode.chained_assignment = None
The Montgomery County government has a comprehensive collection of datasets that report on county demographics, businesses, public safety, and more. From that, we were able to find a dataset that had the potential to answer our questions in the form of a csv file that we can download from their website.
This dataset is updated daily with new crime reports, and contains data that spans from 2017 to the present. It contains 282,000 rows, and has 30 columns containing detailed information about each incident - including start time, number of victims, location, and type of crime. From this plethora of data we will be able to drill down to the most important variables and indicators that we will use to answer our questions.
We downloaded the final version that is used in our analysis on November 08, 2023.
# Load Montgomery dataset
montgomery_crime_df = pd.read_csv('Montgomery_Crime.csv', index_col = ['Incident ID'], dtype={'PRA': str})
The first step of this process is to examine our dataset and understand its main characteristics, which will help us formulate some initial assumptions. This will help us identify any obvious errors, identify patterns, detect outliers, and lay the foundation for the data cleaning tasks that will need to be done.
# Inspect first 5 observations
montgomery_crime_df.head(5)
| Offence Code | CR Number | Dispatch Date / Time | Start_Date_Time | End_Date_Time | NIBRS Code | Victims | Crime Name1 | Crime Name2 | Crime Name3 | ... | Location | Council Districts | Councils | Communities | Zip Codes | Municipalities | Council Districts_from_i23j_3mj8 | Service Regions | Montgomery County Boundary | Council Districts 7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Incident ID | |||||||||||||||||||||
| 201452140 | 1205 | 230065034 | 11/08/2023 03:20:12 AM | 11/08/2023 03:20:00 AM | NaN | 120 | 1 | Crime Against Property | Robbery | ROBBERY - STREET-OTHER WEAPON | ... | (38.9918, -77.0241) | 5.0 | 5.0 | 7.0 | 3065.0 | 1.0 | 5.0 | 1.0 | 1.0 | 7.0 |
| 201452138 | 2203 | 230065033 | 11/08/2023 02:43:30 AM | 11/08/2023 02:30:00 AM | 11/08/2023 02:42:00 AM | 220 | 1 | Crime Against Property | Burglary/Breaking and Entering | BURGLARY - FORCED ENTRY-NONRESIDENTIAL | ... | (39.055, -77.1183) | 3.0 | 3.0 | 19.0 | 60.0 | 1.0 | 3.0 | 3.0 | 1.0 | 7.0 |
| 201452136 | 1305 | 230065028 | 11/08/2023 01:08:41 AM | 11/08/2023 12:08:00 AM | 11/08/2023 01:50:00 AM | 13A | 1 | Crime Against Person | Aggravated Assault | ASSAULT - AGGRAVATED - NON-FAMILY-OTHER WEAPON | ... | (39.0777, -77.14) | 3.0 | 3.0 | 25.0 | 60.0 | 12.0 | 3.0 | 3.0 | 1.0 | 4.0 |
| 201452119 | 2601 | 230065015 | 11/07/2023 09:15:52 PM | 11/07/2023 09:15:00 PM | NaN | 26A | 1 | Crime Against Property | False Pretenses/Swindle/Confidence Game | FRAUD - CONFIDENCE GAME | ... | (39.0834, -77.0652) | 4.0 | 4.0 | 59.0 | 6.0 | 1.0 | 4.0 | 4.0 | 1.0 | 2.0 |
| 201452132 | 1304 | 230065019 | 11/07/2023 09:06:46 PM | 11/07/2023 09:06:00 PM | 11/08/2023 12:00:00 AM | 13A | 5 | Crime Against Person | Aggravated Assault | ASSAULT - AGGRAVATED - NON-FAMILY-GUN | ... | (39.1418, -77.2244) | 3.0 | 3.0 | 36.0 | 33.0 | 21.0 | 3.0 | 5.0 | 1.0 | 4.0 |
5 rows × 38 columns
# Calculate descriptive statistics about numerical variables in dataset
montgomery_crime_df.describe()
| Offence Code | CR Number | Victims | Zip Code | Address Number | Latitude | Longitude | Council Districts | Councils | Communities | Zip Codes | Municipalities | Council Districts_from_i23j_3mj8 | Service Regions | Montgomery County Boundary | Council Districts 7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 279992.000000 | 2.799920e+05 | 279992.000000 | 277598.000000 | 2.533180e+05 | 279992.000000 | 279992.000000 | 276220.000000 | 276220.000000 | 273817.000000 | 277334.000000 | 275385.000000 | 276220.000000 | 274351.000000 | 274100.0 | 276237.000000 |
| mean | 2802.817538 | 1.785562e+08 | 1.028054 | 20877.247509 | 8.530961e+03 | 38.709476 | -76.380807 | 3.246278 | 3.246278 | 27.601161 | 421.351381 | 4.186887 | 3.246278 | 3.437713 | 1.0 | 4.069082 |
| std | 1193.342610 | 6.111787e+07 | 0.209805 | 163.822120 | 1.799520e+04 | 3.790282 | 7.478188 | 1.387858 | 1.387858 | 16.085272 | 995.212405 | 6.534121 | 1.387858 | 1.413078 | 0.0 | 2.018775 |
| min | 901.000000 | 1.004668e+07 | 1.000000 | 6.000000 | 0.000000e+00 | 0.000000 | -77.516700 | 1.000000 | 1.000000 | 1.000000 | 6.000000 | 1.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 |
| 25% | 2303.000000 | 1.800087e+08 | 1.000000 | 20854.000000 | 1.700000e+03 | 39.021810 | -77.198300 | 2.000000 | 2.000000 | 16.000000 | 21.000000 | 1.000000 | 2.000000 | 2.000000 | 1.0 | 2.000000 |
| 50% | 2399.000000 | 1.900553e+08 | 1.000000 | 20878.000000 | 8.200000e+03 | 39.072190 | -77.098400 | 3.000000 | 3.000000 | 25.000000 | 39.000000 | 1.000000 | 3.000000 | 4.000000 | 1.0 | 4.000000 |
| 75% | 2999.000000 | 2.100518e+08 | 1.000000 | 20904.000000 | 1.240000e+04 | 39.142190 | -77.029700 | 5.000000 | 5.000000 | 39.000000 | 71.000000 | 1.000000 | 5.000000 | 5.000000 | 1.0 | 6.000000 |
| max | 9103.000000 | 2.301015e+08 | 22.000000 | 29878.000000 | 2.090600e+06 | 39.342600 | 0.000000 | 5.000000 | 5.000000 | 62.000000 | 3065.000000 | 24.000000 | 5.000000 | 5.000000 | 1.0 | 7.000000 |
# Display shape of dataset
df_shape = montgomery_crime_df.shape
df_shape
(279992, 38)
# Identify column types
df_column_types = montgomery_crime_df.dtypes
# df_column_types
# Identify columns with missing values
missing_columns = montgomery_crime_df.columns[montgomery_crime_df.isnull().any()]
print("Columns with missing values:")
print(montgomery_crime_df[missing_columns].isnull().sum())
Columns with missing values: Dispatch Date / Time 40628 End_Date_Time 139934 Police District Name 476 Block Address 26773 City 13 State 5226 Zip Code 2394 PRA 6 Address Number 26674 Street Prefix 267418 Street Name 477 Street Suffix 275349 Street Type 785 Council Districts 3772 Councils 3772 Communities 6175 Zip Codes 2658 Municipalities 4607 Council Districts_from_i23j_3mj8 3772 Service Regions 5641 Montgomery County Boundary 5892 Council Districts 7 3755 dtype: int64
We want to analyze the different types of crimes that are committed, and potentially do further categorization, so we first need to determine the existing categorization provided by this dataset.
# Display value counts for crime types to gather initial information about segments
montgomery_crime_df['Crime Name1'].value_counts()
Crime Name1 Crime Against Property 175277 Crime Against Society 67030 Crime Against Person 37685 Name: count, dtype: int64
# Display value counts for sub-category of crime type
crime_counts = montgomery_crime_df.groupby(['Crime Name1', 'Crime Name2']).size()
crime_counts
Crime Name1 Crime Name2
Crime Against Person Aggravated Assault 5428
Forcible Fondling 2158
Forcible Rape 1709
Forcible Sodomy 730
Human Trafficking, Commercial Sex Acts 66
Human Trafficking, Involuntary Servitude 2
Incest 7
Intimidation 495
Justifiable Homicide 14
Kidnapping/Abduction 96
Murder and Nonnegligent Manslaughter 153
Negligent Manslaughter 2
Purchasing Prostitution 97
Sexual Assault With An Object 506
Simple Assault 26114
Statuory Rape 108
Crime Against Property All other Larceny 17028
Arson 440
Bad Checks 362
Bribery 2
Burglary/Breaking and Entering 10670
Counterfeiting/Forgery 3756
Credit Card/Automatic Teller Machine Fraud 6137
Destruction/Damage/Vandalism of Property 23503
Embezzlement 562
Extortion/Blackmail 548
False Pretenses/Swindle/Confidence Game 8143
From Coin/Operated Machine or Device 72
Identity Theft 10357
Impersonation 1665
Motor Vehicle Theft 10178
Pocket/picking 878
Purse-snatching 508
Robbery 4344
Shoplifting 21444
Stolen Property Offenses 263
Theft From Motor Vehicle 31259
Theft from Building 13527
Theft of Motor Vehicle Parts or Accessories 9251
Welfare Fraud 43
Wire Fraud 337
Crime Against Society Animal Cruelty 94
Assisting or Promoting Prostitution 22
Curfew/Loitering/Vagrancy Violations 38
Disorderly Conduct 6367
Driving Under the Influence 16797
Drug Equipment Violations 2314
Drug/Narcotic Violations 25234
Drunkenness 117
Family Offenses, NonViolent 923
Liquor Law Violations 6999
Operating/Promoting/Assisting Gambling 1
Peeping Tom 134
Pornography/Obscene Material 385
Prostitution 206
Trespass of Real Property 4467
Weapon Law Violations 2932
dtype: int64
# Display value counts for final sub-category of crime type
crime_sub_category = montgomery_crime_df.groupby(['Crime Name1', 'Crime Name2', 'Crime Name3']).size()
crime_sub_category
Crime Name1 Crime Name2 Crime Name3
Crime Against Person Aggravated Assault ASSAULT - AGGRAVATED - FAMILY-GUN 108
ASSAULT - AGGRAVATED - FAMILY-OTHER WEAPON 853
ASSAULT - AGGRAVATED - FAMILY-STRONG-ARM 763
ASSAULT - AGGRAVATED - GUN 666
ASSAULT - AGGRAVATED - NON-FAMILY-GUN 284
...
Crime Against Society Weapon Law Violations WEAPON - INCENDIARY DEVICE - USING 4
WEAPON - POSSESSION 1275
WEAPON - SELLING 3
WEAPON OFFENSE (DESCRIBE OFFENSE) 218
WEAPON TRAFFICKING 4
Length: 268, dtype: int64
We have several columns with thousands of observations that are null. There is also the potential that our data may be corrupted, improperly formatted, duplicated, or incomplete. In order to rectify these issues, we will need to clean this dataset prior to performing any analyses, as inaccurate data can lead to unreliable algorithms/results.
Before we proceed with handling the null data, we should determine if there are any duplicates. We will do this by looking specifically at the Incident ID - duplicates may have resulted from errors in data entry, data extraction, or other data collection processes.
# Identify duplicate records based on the Incident ID
duplicate_records = montgomery_crime_df.index.duplicated()
#montgomery_crime_df[duplicate_records]
# Display the number of duplicate records
print('Number of duplicate records: ', montgomery_crime_df.index.duplicated().sum())
Number of duplicate records: 25371
# Remove duplicated records, keeping the first instance because df is sorted in descending order by time
montgomery_crime_df = montgomery_crime_df[~montgomery_crime_df.index.duplicated(keep='first')]
# montgomery_crime_df
We will remove several columns from our dataset for the following reasons.
Police District Number: The information about the police district is already captured in the 'Police District Name' column. Including both columns is redundant, and the 'Police District Name' provides sufficient information for analysis.
Zip Codes: The 'Zip Code' column already contains information about the location. Including both 'Zip Code' and 'Zip Codes' is redundant, and 'Zip Code' is more relevant for location-based analysis.
Municipalities: The dataset already includes the 'City' column, which represents the municipality or city associated with each incident. Including 'Municipalities' is redundant and does not provide additional meaningful information.
Council Districts_from_i23j_3mj8: The purpose and origin of this column are not explicitly defined, and its inclusion does not contribute to the analysis. It lacks clear relevance or significance for the crime-related insights sought from the dataset.
Service Regions: The role and significance of 'Service Regions' in the context of crime analysis are not apparent. Without a clear understanding of its relevance, this column is excluded from the dataset.
Council Districts 7: The dataset already contains 'Council Districts,' and the additional 'Council Districts 7' column lacks justification or documentation for its inclusion. It is prudent to rely on the more standard 'Council Districts' column for analysis.
# Drop unnecessary columns
montgomery_crime_df.drop(['Police District Number','Zip Codes', 'Municipalities', 'Council Districts_from_i23j_3mj8', 'Service Regions', 'Council Districts 7'], axis=1, inplace=True)
# montgomery_crime_df
There are multiple cities that are mispelled in our dataset. Because we plan to use this column in our analysis, we need to figure out how to fix the incorrect values.
# Replace the NaN values with an empty string for consistency
montgomery_crime_df['City'] = montgomery_crime_df['City'].fillna('')
# Define a mapping for corrections
city_corrections = {
'SILVER SPRING': ['SILVER SPRIN G', 'SILVERS SPRING', 'SILER SPRING', 'SILVE SPRING', 'SILVERSPRING', 'SILVER APRING', 'SILVER SPRIG', 'SILVER SPRING`', 'SLVER SPRING', 'SLIVER SPRING', 'SILV ER SPRING', 'SILVE4R SPRING', 'SILVER SPRING`', 'SILVER SPRNIG', 'SILVER SPRIING', 'SILVER SPRIND', 'SILVER SPRIGN'],
'ROCKVILLE': ['ROCKVIILE', 'ROCKVIL', 'ROCKVIILE', 'ROCKVILEE', 'ROCKVILLE,', 'ROCKIVLLE', 'ROCKVILLE,', 'ROCKVIILLE', 'ROCVILLE', 'ROCVKILLE', 'ROCKVIILE'],
'GAITHERSBURG': ['GAITHERBURG', 'GAITHERBURG', 'GAITHERSBUG', 'GAITHERSBUIRG', 'GAITHERSBURT', 'GAITHERSBRUG', 'GAITHERSGURG', 'GAITHESRBURG', 'GAITERSBURG', 'GAITHERSRBURG', 'GAIHTERSBURG', 'GAUTHERSBURG', 'GAIHTERSBURG', 'GAIHERSBURG', 'GATIHERSBURG', 'GAITHERBSURG', 'GAITHERSBRG', 'GAITHERSBRG', 'GAITHERSURG', 'GAITHRESBURG', 'GAISTHERSBURG', 'GAITHERSURG', 'GITHERSBURG', 'GITHERSBYRG'],
'BETHESDA': ['BEHTESDA', 'BEHESDA', 'BETHEDA', 'BETHESDAS', 'BETHESDAS', 'BETHSDA', 'BETESDA', 'BETHESDA'],
'TAKOMA PARK': ['TACOMA PARK', 'TAKOMA PARK'],
# Add more entries as needed
}
city_corrections_manual = {
'BETHESDA' : ['BEHTESDA', 'BEHESDA', 'BETESDA', 'BETHEDA', 'BETHESA', 'BETHESDAS', 'BETHSDA'],
'Beallsville': ['BEALSVILLE'],
'BARNESVILLE' : ['BARNESVIILE','BARNESVIILE','BARNSVILLE'],
'BROOKEVILLE' : 'BROOKVILLE',
'BURTONSVILLE' : ['BURTOSNVILLE','BURTSONVILLE', 'BUTINSVILLE'],
'CHEVY CHASE': ['CEHVY CHASE', 'CHVEY CHASE', 'CHEVY CHASE #3', 'CHEVY CHASE #4', 'CHEVY CHASE VIEW', 'CHEVY CHASE VILLAGE'],
'CLARKSBURG': ['CALARKSBURG', 'CLAEKSBURG','CLARKESBURG','CLARKSURG', 'CLARSBURG','CLARSKBURG'],
'COMUS': 'COMUS',
'DAMASCUS': 'DANASCUS',
'DERWOOD' : 'DEERWOOD',
'FREDERICK': 'FREDERICK',
'FRIENDSHIP HEIGHTS': 'FRIENDHSIP HEIGHTS',
'GAITHERSBURG' : ['7','GATIHERSBURG', 'GAIHTERSBURG', 'GAITHERBURG', 'GAITHERESBURG','GAITEHRSBURG', 'GITHERSBURG', 'GAIHERSBURG', 'GAISTHERSBURG', 'GAITERSBURG', 'GAITHERBSURG', 'GAITHERSBRG', 'GAITHERSBRUG', 'GAITHERSBUG', 'GAITHERSBUIRG', 'GAITHERSBURT', 'GAITHERSBYRG', 'GAITHERSGURG', 'GAITHERSRBURG', 'GAITHERSSBURG', 'GAITHERSURG', 'GAITHESBURG', 'GAITHESRBURG', 'GAITHRERSBURG', 'GAITHRESBURG', 'GIATHERSBURG', 'GATHERSBURG', 'GAUTHERSBURG'],
'GARRETT PARK' : 'GARRETT PARK',
'GERMANTOWN' : ['4','GRMANTOWN', 'GEMANTOWN', 'GERMATOWN', 'GERMANTNOWN', 'GERMANTOWM', 'GERMANTOOWN', 'GERMANTOEN', 'GERMANTONW', 'GERMANTIWN', 'GEERMANTOWN', 'GERAMNTOWN', 'GERANTOWN', 'GERMAN4TOWN', 'GERMANTOW', 'GERMANTOWNMD','GERMANTWN', 'GERMANTWON', 'GERMNATOWN', 'GERMNTOWN', 'GERRMANTOWN'],
'GLEN ECHO': 'GLEN ECHO`',
'HYATTSTOWN': ['HYATTTOWN','1'],
'KENSINGTON': ['KENSIGNTON', 'KENSINGTOWN', 'KENSONGTON', 'KENSTINGTON', 'KENNSINGTON', 'KENSINGTNO'],
'LAYTONSVILLE': 'LATONSVILLE',
'MONTGOMERY VILLAGE': ['20877','6','MCG','MCGGAITHERSBURG', 'MONT VILLAGE', 'MONTGGOMERY VILLAGE', 'MONTGOMERY VILAGE','MONTGOMERY VILLAE', 'MONTGOMERY VILLLAGE', 'MONTGOMRY VILLAGE', 'MONTOMGERY VILLAGE','MOTGOMERY VILLAGE', 'MOMTGOMERY VILLAGE', 'MONGTOMERY VILLAGE'],
'MOUNT AIRY': ['MT AIRY', 'MT. AIRY'],
'NORTH BETHESDA': ['N BETHESDAQ', 'N. BETHESDA', 'N. POTOMAC', 'NORTH BEHTESDA', 'NORTH BETHSDA','NORTH POTOAMC', 'NOTRTH POTOMAC', 'N BETHESDA'],
'NORTH POTOMAC' : 'N POTOMAC',
'OLNEY': ['ONLEY','ONEY'],
'POOLESVILLE': ['POOLSVILLE','PO'],
'POTOMAC': 'POTIMAC',
'ROCKVILLE': ['3','ROCKIVLLE', 'ROCKVILLLE', 'ROCKVIILLE', 'ROCVILLE', 'ROCVKILLE', 'RCKVILLE', 'ROCKILLE', 'ROCKIVILLE', 'ROCKVIILE', 'ROCKVILEE', 'ROCKVILL', 'ROCKVLLE', 'ROKVILLE', 'ROOCKVILLE', "ROCKVILLE'", 'ROCKVILLE,','ROCKVILE'],
'SANDY SPRING': 'SANDY SPPRING',
'SILVER SPRING': ['2','SILVER SPRING','SLIVER SPRING','SILVE4R SPRING', 'SILVER', 'SILVER APRING', 'SILVER SPING', 'SILVER SPRIING', 'SILVER SPRIND', 'SILVER SPRING', 'SILVER SPRINGQ', 'SILVER SPRIGN','SILVER SRING', 'SILVER SRING', 'SIVLER SPRING', 'SLVER SPRING', 'SIVER SPRING', 'SILER SPRING', 'SILV ER SPRING', 'SILVE SPRING', 'SILVER SPRING', 'SILVER SPIRNG','SILVER SPRIG', 'SILVER SPRIN', 'SILVER SPRIN G', 'SILVER SPRING`', 'SILVER SPRNG','SILVER SPRNIG', 'SILVER SPSRING', 'SILVER SRPING', 'SILVERS SPRING', 'SILVERSPRING','SILVR SPRING'],
'SPENCERVILLE': 'APENCERVILLE',
'TAKOMA PARK': ['TAKOMS PARK','TACOMA PARK', 'TAKOMA','TP'],
'VALLEYWOOD': 'VALLEYWOOD',
'WHEATON': ['WEHATON','WEATON']
}
# Function to correct misspelled city names
def correct_city_name(city):
for correct_name, misspelled_names in city_corrections_manual.items():
if city in misspelled_names:
return correct_name
return city # Return the original city name if not found in the misspelled names
# Apply the correction function to the 'City' column
montgomery_crime_df['City'] = montgomery_crime_df['City'].apply(correct_city_name)
There are some corrupted values, specifically in the Dispatch Date/Time, Start_Date_Time, and Agency columns, that are not feasible to impute. Instead, we will remove them from the dataset.
# Filter out rows where either dispatch date/time or start date time are missing
montgomery_crime_df = montgomery_crime_df[montgomery_crime_df['Dispatch Date / Time'].notnull() & montgomery_crime_df['Start_Date_Time'].notnull()]
# Filter out rows where dispatch date/time is prior to start date time
montgomery_crime_df = montgomery_crime_df[montgomery_crime_df['Dispatch Date / Time'] >= montgomery_crime_df['Start_Date_Time']]
# Drop rows where Agency = P, since this is a meaningless value
montgomery_crime_df.drop(montgomery_crime_df[montgomery_crime_df['Agency'] == 'P'].index, inplace=True)
montgomery_crime_df.drop(montgomery_crime_df[montgomery_crime_df['Agency'] == 'OTHR'].index, inplace=True)
Some of our columns are not the desired types. Lets change that for all 3 columns that represent a timestamp of a crime observation - start, dispatch, and end.
# Convert 'Dispatch Date / Time' to datetime format
montgomery_crime_df['Dispatch Date / Time'] = pd.to_datetime(montgomery_crime_df['Dispatch Date / Time'])
# Convert 'Start_Date_Time' to datetime format
montgomery_crime_df['Start_Date_Time'] = pd.to_datetime(montgomery_crime_df['Start_Date_Time'])
# Convert 'End_Date_Time' to datetime format
montgomery_crime_df['End_Date_Time'] = pd.to_datetime(montgomery_crime_df['End_Date_Time'])
We have some outliers in the latitude and longitude columns - we will filter those out.
# Remove locations that are not within montgomery county dimensions
montgomery_crime_df = montgomery_crime_df[((montgomery_crime_df['Latitude'] > 30) & (montgomery_crime_df['Latitude'] < 50)
& (montgomery_crime_df['Longitude'] > -85)
& (montgomery_crime_df['Longitude'] < -70))]
Lastly, our data covers crimes ranging from 2016 to November 8th, 2023. We decided to filter out any crime data that occurred prior to 2016, as the data quality of those observations is low (missing crime categorizations), possibly due to still figuring out the system of data collection.
# filter out any observations that occurred prior to 2017
montgomery_crime_df = montgomery_crime_df[montgomery_crime_df['Start_Date_Time'] > '2017-01-01']
Now that we've done that, we want to perform some grouping operations that will be used later on in our analysis. The Start Time, which we have defined as when a crime is reported by either a citizen or the police, is what we will focus on as it is the best representation of "when" a crime occurred.
# Extract month and year information from 'Start_Date_Time'
montgomery_crime_df['Month'] = montgomery_crime_df['Start_Date_Time'].dt.month
montgomery_crime_df['Year'] = montgomery_crime_df['Start_Date_Time'].dt.year
# Mapping numerical months to corresponding names for readability
month_names = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June', 7: 'July',
8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'}
# Define the order of months for categorical ordering
month_order = [month for month in month_names.values()]
# Create a new column 'Month_Name' and map the month names
montgomery_crime_df['Month_Name'] = montgomery_crime_df['Month'].map(month_names)
# Set 'Month_Name' as a categorical variable w|ith the specified order
montgomery_crime_df['Month_Name'] = pd.Categorical(montgomery_crime_df['Month_Name'], categories=month_order, ordered=True)
# Group by 'Year' and 'Month_Name', and count the occurrences of crimes
incident_groups = montgomery_crime_df.groupby(['Year', 'Month_Name'], observed=True)['Crime Name1'].count()
# Extract Hour information into separate column
montgomery_crime_df['Hour'] = montgomery_crime_df['Start_Date_Time'].dt.hour
To answer our questions about crime trends in Montgomery County, we should also analyze the time series data for different categories of crime. We will do this by creating a time series line graph that plots the total number of crimes taking place on a monthly basis. We will also take a cross section of the data for each of the 3 sub-categories of crime - Crime Against Person, Crime Against Property, and Crime Against Society.
Crime Against Person is defined as incidents where the victims are individuals, and includes assault, homicide, and kidnapping.
Crime Against Property is defined as an incident where the motivation is to obtain some kind of benefit, like money or property, and includes breaking & entering, car theft, robbery, and shoplifting.
Crime Against Society is defined as an incident that violates society’s prohibition against engaging in certain types of activity and is typically victimless crimes, and includes drug/narcotics violations, driving under the influence, and disorderly conduct (fbi.gov).
# Create dummy variables for crime category
crime_name_dummies = pd.get_dummies(montgomery_crime_df['Crime Name1'], prefix='dummy', prefix_sep='_', dummy_na=False, columns=['Crime Name1'], dtype='int')
# merge dummy variables with main dataframe
montgomery_crime_df = montgomery_crime_df.merge(crime_name_dummies, on='Incident ID');
# Create new column that does not contain time values
montgomery_crime_df['Day'] = montgomery_crime_df['Start_Date_Time'].dt.date
# Count the number of crimes in each month
group_by_date = montgomery_crime_df.groupby('Day', observed=True)['Crime Name1'].count()
group_by_date.index = pd.to_datetime(group_by_date.index)
group_by_month = group_by_date.resample('1M').sum()
# Sum the number of crimes against person for each month in dataset
person_group_by_date = montgomery_crime_df.groupby('Day')['dummy_Crime Against Person'].sum()
# Convert to datetime
person_group_by_date.index = pd.to_datetime(group_by_date.index)
# Resample dataset to aggregate counts based on month
person_group_by_month = person_group_by_date.resample('1M').sum()
# Drop final month since it is incomplete
person_group_by_month.drop(person_group_by_month.tail(1).index,inplace=True)
# Sum the number of crimes against person for each month in dataset
property_group_by_date = montgomery_crime_df.groupby('Day')['dummy_Crime Against Property'].sum()
# Convert to datetime
property_group_by_date.index = pd.to_datetime(group_by_date.index)
# Resample dataset to aggregate counts based on month
property_group_by_month = property_group_by_date.resample('1M').sum()
# Drop final month since it is incomplete
property_group_by_month.drop(property_group_by_month.tail(1).index,inplace=True)
# Sum the number of crimes against person for each month in dataset
society_group_by_date = montgomery_crime_df.groupby('Day')['dummy_Crime Against Society'].sum()
# Convert to datetime
society_group_by_date.index = pd.to_datetime(group_by_date.index)
# Resample dataset to aggregate counts based on month
society_group_by_month = society_group_by_date.resample('1M').sum()
# Drop final month since it is incomplete
society_group_by_month.drop(society_group_by_month.tail(1).index,inplace=True)
# Drop the final value in the total crime rate series
group_by_month.drop(group_by_month.tail(1).index,inplace=True)
# Initialize values for use in our animated plot
crime_rate = group_by_month.values
t = group_by_month.index
x,y = [], []
person_crime_rate = person_group_by_month.values
y1 = []
property_crime_rate = property_group_by_month.values
y2 = []
society_crime_rate = society_group_by_month.values
y3 = []
%%capture
# create matplotlib figure and axes
fig = plt.figure(figsize=(14,8));
axes = fig.add_subplot(1,1,1);
# set initial values and configuration
axes.set_ylim(0, 3000)
axes.set_title('Montgomery County Crime Rate', fontsize=20)
axes.set_xlabel('Month/Year')
axes.set_ylabel('Number of Crimes')
crime_rate_plot = plt.plot([], [])
# create legend for plot
all_crime = plt.plot(x,y, color='black', marker='*', scaley=True, scalex=True, label='All Crime')
person_crime = plt.plot(x,y1, 'r-.', scaley=True, scalex=True, label='Crime Against Person')
property_crime = plt.plot(x,y2, 'b--', scaley=True, scalex=True, label='Crime Against Property')
society_crime = plt.plot(x,y3, 'g:', scaley=True, scalex=True, label='Crime Against Society')
legend = plt.legend(loc=1)
# Define function that will animate the plot
def animate(i):
# append new values to the arrays that will be plotted
x.append(t[i])
y.append(crime_rate[i])
y1.append(person_crime_rate[i])
y2.append(property_crime_rate[i])
y3.append(society_crime_rate[i])
# plot crime rates
all_crime = plt.plot(x,y, color='black', marker='*', scaley=True, scalex=True)
person_crime = plt.plot(x,y1, 'r-.', scaley=True, scalex=True)
property_crime = plt.plot(x,y2, 'b--', scaley=True, scalex=True)
society_crime = plt.plot(x,y3, 'g:', scaley=True, scalex=True)
# ADD ARROWS AND ANIMATIONS
if i > 40:
# plot marker for first US case of covid 19
plt.text(dt.datetime.strptime('03/25/2019', '%m/%d/%Y'), 900, 'First Covid-19\ncase reported\nin U.S.', color='grey', fontsize='large', alpha=0.1)
plt.plot([dt.datetime.strptime('01/20/2020', '%m/%d/%Y')] * 100, range(0, 4000, 40), '--', linewidth=1, alpha=0.01)
# plot marker for official start of covid pandemic
plt.text(dt.datetime.strptime('03/20/2020', '%m/%d/%Y'), 800, 'WHO declares\nCovid-19 Pandemic', color='grey', fontsize='large', alpha=0.1)
plt.plot([dt.datetime.strptime('03/11/2020', '%m/%d/%Y')] * 100, range(0, 4000, 40), '--', linewidth=1, alpha=0.01)
# set higher animation size limit
matplotlib.rcParams['animation.embed_limit'] = 2**128
# create animation using animate function
anim = FuncAnimation(fig=fig, func=animate, frames=len(group_by_month), interval=400, repeat=False)
# show plot and enable viewing in html
display(HTML(anim.to_jshtml()))
# Automatically play the animation once it is generated in HTML
Javascript('document.querySelector(".anim-buttons > button:nth-child(6)").click()')